1 Introduction

We’ll use the Credit dataframe from the ISLR package to demonstrate multiple regression with:

  1. A numerical outcome variable \(y\), in this case credit card balance.
  2. Two explanatory variables:
    • A first numerical explanatory variable \(x_1\). In this case, their credit limit.
    • A second numerical explanatory variable \(x_2\). In this case, their income (in thousands of dollars).

2 Exploratory data analysis

library(ISLR)
data(Credit)
#View(Credit)
glimpse(Credit)
## Observations: 400
## Variables: 12
## $ ID        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ Income    <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.…
## $ Limit     <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, …
## $ Rating    <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589,…
## $ Cards     <int> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, …
## $ Age       <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49…
## $ Education <int> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 1…
## $ Gender    <fct>  Male, Female,  Male, Female,  Male,  Male, Female,  M…
## $ Student   <fct> No, Yes, No, No, No, No, No, No, No, Yes, No, No, No, …
## $ Married   <fct> Yes, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No, Y…
## $ Ethnicity <fct> Caucasian, Asian, Asian, Asian, Caucasian, Caucasian, …
## $ Balance   <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 14…
credit = Credit %>% select(Balance, Limit, Income, Rating, Age)
glimpse(credit)
## Observations: 400
## Variables: 5
## $ Balance <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407…
## $ Limit   <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 68…
## $ Income  <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.99…
## $ Rating  <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 1…
## $ Age     <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, …

Let’s look at some summary statistics for the variables that we need for the problem at hand.

library(skimr)
## 
## Attaching package: 'skimr'
## The following object is masked from 'package:stats':
## 
##     filter
library(ggplot2)
library(cowplot)
## 
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
##   default ggplot2 theme anymore. To recover the previous
##   behavior, execute:
##   theme_set(theme_cowplot())
## ********************************************************
credit %>% skim()
## Skim summary statistics
##  n obs: 400 
##  n variables: 5 
## 
## ── Variable type:integer ─────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete   n    mean      sd  p0     p25    p50     p75
##       Age       0      400 400   55.67   17.25  23   41.75   56     70   
##   Balance       0      400 400  520.01  459.76   0   68.75  459.5  863   
##     Limit       0      400 400 4735.6  2308.2  855 3088    4622.5 5872.75
##    Rating       0      400 400  354.94  154.72  93  247.25  344    437.25
##   p100     hist
##     98 ▅▆▇▆▆▆▃▁
##   1999 ▇▃▃▃▂▁▁▁
##  13913 ▅▇▇▃▂▁▁▁
##    982 ▅▇▇▅▂▁▁▁
## 
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete   n  mean    sd    p0   p25   p50   p75   p100
##    Income       0      400 400 45.22 35.24 10.35 21.01 33.12 57.47 186.63
##      hist
##  ▇▃▂▁▁▁▁▁
pAge = ggplot(data = credit, mapping = aes(x = Age)) +
  geom_histogram(binwidth = 10, fill = 'red2', color = 'red4')

pBalance = ggplot(data = credit, mapping = aes(x = Balance)) +
  geom_histogram(binwidth = 200, fill = 'blue2', color = 'blue4')

pLimit = ggplot(data = credit, mapping = aes(x = Limit)) +
  geom_histogram(binwidth = 1000, fill = 'yellow2', color = 'yellow4')

pRating = ggplot(data = credit, mapping = aes(x = Rating)) +
  geom_histogram(binwidth = 100, fill = 'green2', color = 'green4')

plot_grid(pAge, pBalance, pLimit, pRating)

Let’s also look at histograms as visual aids.

We observe for example:

Since our outcome variable Balance and the explanatory variables Limit and Income are numerical, we can and have to compute the correlation coefficient between pairs of these variables before we proceed to build a model.

credit %>%
  select(Balance, Limit, Income) %>% 
  cor()
##           Balance     Limit    Income
## Balance 1.0000000 0.8616973 0.4636565
## Limit   0.8616973 1.0000000 0.7920883
## Income  0.4636565 0.7920883 1.0000000

Note: Collinearity (or multicollinearity) is a phenomenon in which one explanatory variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. So in this case, if we knew someone’s credit card Limit and since Limit and Income are highly correlated, we could make a fairly accurate guess as to that person’s Income. Or put loosely, these two variables provided redundant information. For now let’s ignore any issues related to collinearity and press on.

Let’s visualize the relationship of the outcome variable with each of the two explanatory variables in two separate plots:

To get a sense of the joint relationship of all three variables simultaneously through a visualization, let’s display the data in a 3-dimensional (3D) scatterplot, where

  1. The numerical outcome variable \(y\) Balance is on the \(z\)-axis (vertical axis)
  2. The two numerical explanatory variables form the “floor” axes. In this case
    • The first numerical explanatory variable \(x_1\), Income is on of the floor axes.
    • The second numerical explanatory variable \(x_2\), Limit is on the other floor axis.
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
p <- plot_ly(data = Credit, z = ~Balance, x = ~Income, y = ~Limit, opacity = 0.6, color = Credit$Balance) %>%
  add_markers() 
p

2.0.1 Exercise

Conduct a new exploratory data analysis with the same outcome variable \(y\) being Balance but with Rating and Age as the new explanatory variables \(x_1\) and \(x_2\). Remember, this involves three things:

  1. Looking at the raw values
  2. Computing summary statistics of the variables of interest.
    • Half of cardholders are over the age of 56.
    • 25% of cardholders have a credit rating of under 247.25
    • Only a quarter of cardholders are under 42 years old
    • 25% of cardholders are over 70 years old
    • 25% of cardholders have a credit rating over 437.25
    • The youngest credit card holder is 23 years old
    • The oldest credit card holder is 98 years old
    • The lowest credit card rating is 93
    • The highest credit card rating is 982
  3. Creating informative visualizations

What can you say about the relationship between a credit card holder’s balance and their credit rating and age?

credit %>%
  select(Balance, Rating, Age) %>% 
  cor()
##             Balance    Rating         Age
## Balance 1.000000000 0.8636252 0.001835119
## Rating  0.863625161 1.0000000 0.103164996
## Age     0.001835119 0.1031650 1.000000000
p <- plot_ly(data = Credit, z = ~Balance, x = ~Rating, y = ~Age, opacity = 0.6, color = Credit$Balance) %>%
    add_markers() 
p

Balance with Limit is 0.862. This indicates a strong positive linear relationship, which makes sense as only individuals with large credit limits can accrue large credit card balances. Balance with Income is 0.464. This is suggestive of another positive linear relationship, although not as strong as the relationship between Balance and Limit. As an added bonus, we can read off the correlation coefficient of the two explanatory variables, Limit and Income of 0.792. In this case, we say there is a high degree of collinearity between these two explanatory variables.

2.1 Multiple regression

We now use a + to consider multiple explanatory variables. Here is the syntax:

model_name <- lm(y ~ x1 + x2 + ... +xn, data = data_frame_name)
Balance_model <- lm(Balance ~ Limit + Income, data = Credit)
Balance_model
## 
## Call:
## lm(formula = Balance ~ Limit + Income, data = Credit)
## 
## Coefficients:
## (Intercept)        Limit       Income  
##   -385.1793       0.2643      -7.6633
# Or use one of the followings to see more info...
library(moderndive)
get_regression_table(Balance_model)
## # A tibble: 3 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept -385.       19.5       -19.8       0 -423.    -347.   
## 2 Limit        0.264     0.006      45.0       0    0.253    0.276
## 3 Income      -7.66      0.385     -19.9       0   -8.42    -6.91
#summary(Balance_model)

Model: \(\hat{Balance}= -385 + 0.264 \cdot Limit - 7.66 \cdot Income\)

How do we interpret these three values that define the regression plane?

  • Intercept: \(-\$385.18\) (rounded to two decimal points to represent cents). The intercept in our case represents the credit card balance for an individual who has both a credit Limit of \(\$0\) and Income of \(\$0\).
    • In our data however, the intercept has limited (or no) practical interpretation as ….
  • Limit: \(\$0.26\). Now that we have multiple variables to consider, we have to add a caveat to our interpretation: Holding all the other variables fixed (Income, in this case), for every increase of one unit in credit Limit (dollars), there is an associated increase of on average \(\$0.26\) in credit card balance.
  • Income: \(-\$7.66\). Similarly, Holding all the other variables fixed (Limit, in this case), for every increase of one unit in Income (in other words, \(\$10000\) in income), there is an associated decrease of on average \(\$7.66\) in credit card balance.

2.2 Observed/fitted values and residuals

As we did previously, let’s look at the fitted values and residuals.

regression_points <- get_regression_points(Balance_model)
regression_points
## # A tibble: 400 x 6
##       ID Balance Limit Income Balance_hat residual
##    <int>   <int> <int>  <dbl>       <dbl>    <dbl>
##  1     1     333  3606   14.9        454.   -121. 
##  2     2     903  6645  106.         559.    344. 
##  3     3     580  7075  105.         683.   -103. 
##  4     4     964  9504  149.         986.    -21.7
##  5     5     331  4897   55.9        481.   -150. 
##  6     6    1151  8047   80.2       1127.     23.6
##  7     7     203  3388   21.0        349.   -146. 
##  8     8     872  7114   71.4        948.    -76.0
##  9     9     279  3300   15.1        371.    -92.2
## 10    10    1350  6819   71.1        873.    477. 
## # … with 390 more rows

2.2.1 Diagnostics (Residual plot)

ggplot(Balance_model, aes(x = .fitted, y = .resid)) + geom_point()

2.2.2 Making predictions

Assuming the model is good…

Kevin has a credit limit of $5080 and his income is $150,000. Use the Balance_model to predict Kevin’s balance.

newx <- data.frame(Limit = _____, Income = ____)

predicted_balance <- predict(Balance_model, newx)
predicted_balance
newx = data.frame(Limit = c(5080,4090), Income = c(15,30))
predicted_balance = predict(Balance_model, newx)
predicted_balance
##        1        2 
## 842.6244 465.9962

2.3 One numerical & one categorical explanatory variable

Let’s revisit the instructor evaluation data introduced in Ch 6.

Consider a modeling scenario with:

  1. A numerical outcome variable \(y\). As before, instructor evaluation score.
  2. Two explanatory variables:
    1. A numerical explanatory variable \(x_1\): in this case, their age.
    2. A categorical explanatory variable \(x_2\): in this case, their binary gender.

2.3.1 Exploratory data analysis

Let’s reload the evals data and select() only the needed subset of variables. Note that these are different than the variables chosen in Chapter 6. Let’s given this the name evals_ch7.

  1. Let’s look at the raw data values both by using View() and the glimpse() functions.
  2. Let’s look at some summary statistics using the skim() function from the skimr package:
library(moderndive)
data(evals)

evals_ch7 = evals %>% select(age, gender, score)

glimpse(evals_ch7)
## Observations: 463
## Variables: 3
## $ age    <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 4…
## $ gender <fct> female, female, female, female, male, male, male, male, m…
## $ score  <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.…
evals_ch7 %>% skim()
## Skim summary statistics
##  n obs: 463 
##  n variables: 3 
## 
## ── Variable type:factor ──────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete   n n_unique                top_counts ordered
##    gender       0      463 463        2 mal: 268, fem: 195, NA: 0   FALSE
## 
## ── Variable type:integer ─────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete   n  mean  sd p0 p25 p50 p75 p100     hist
##       age       0      463 463 48.37 9.8 29  42  48  57   73 ▅▅▅▇▅▇▂▁
## 
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete   n mean   sd  p0 p25 p50 p75 p100     hist
##     score       0      463 463 4.17 0.54 2.3 3.8 4.3 4.6    5 ▁▁▂▃▅▇▇▆
  1. Let’s compute the correlation between two numerical variables we have score and age. Recall that correlation coefficients only exist between numerical variables.
evals_ch7 %>% 
  get_correlation(formula = score ~ age)
## # A tibble: 1 x 1
##   correlation
##         <dbl>
## 1      -0.107

We observe that the age and the score are weakly and negatively correlated.

Now, let’s try to visualize the correlation.

  • Create a scatterplot of score over age. Use the binary gender variable to color the point with two colors. Add a regression line (or two) in to your scatterplot.
    • Say a couple of interesting things about the graph you’ve created.
ggplot(data = evals_ch7, mapping = aes(x = age, y = score, color = gender)) + 
    geom_point(alpha = 0.5) +
    labs(x = "Age", y = "Teaching Score", title = "Using method = lm") +
    geom_smooth(method = "lm", se = FALSE)

  • Females experience a more severe decline in teaching score with age
  • Many of the poor teaching scores for men come from men over 55, while many of the poor teaching scores for women come from women between 35 and 55

2.3.2 Multiple regression: Parallel slopes model

Much like we started to consider multiple explanatory variables using the + sign in the previous section, let’s fit a regression model and get the regression table.

score_model_ch7 = lm(score ~ age + gender, data = evals_ch7)
score_model_ch7
## 
## Call:
## lm(formula = score ~ age + gender, data = evals_ch7)
## 
## Coefficients:
## (Intercept)          age   gendermale  
##    4.484116    -0.008678     0.190571
#get_regression_table(score_model_ch7)

Full: \(\hat{Score} = 4.48 - 0.009 \cdot age + 0.191 \cdot 1_{Male}(x)\)

Male: \(\hat{Score_M} = 4.671 - 0.009 \cdot age\)

Female: \(\hat{Score_F} = 4.48 - 0.009 \cdot age\)

Let’s create the scatterplot of score over age again. Use the binary gender variable to color the point with two colors. Add a regression lines in to your scatterplot but use the model(s) we created.

ggplot(data = evals_ch7, mapping = aes(x = age, y = score, color = gender)) + 
    geom_point(alpha = 0.5) +
    labs(x = "Age", y = "Teaching Score", title = "Using Parallel Slopes") +
    geom_abline(intercept = 4.48, slope = -0.009, color = "tomato", lwd=1) +
    geom_abline(intercept = 4.671, slope = -0.009, color = "mediumturquoise", lwd=1)

Interpretaions of the coefficients:

  • \(b_{male}=0.1906\) is the average difference in teaching score that men get relative to the baseline of women.
  • Accordingly, the intercepts (which in this case make no sense since no instructor can have an age of 0) are :
    • for women: \(b_0=4.484\)
    • for men: \(b_0+b_{male}=4.484+0.191=4.675\)
  • Both men and women have the same slope. In other words, in this model the associated effect of age is the same for men and women. So for every increase of one year in age, there is on average an associated decrease of \(b_{age}=−0.009\) in teaching score.

2.3.3 Multiple Regression: Interaction Model

We say a model has an interaction effect if the associated effect of one variable depends on the value of another variable. These types of models usually prove to be tricky to view on first glance because of their complexity. In this case, the effect of age will depend on the value of gender. (as was suggested by the different slopes for men and women in our visual exploratory data analysis)

Let’s fit a regression with an interaction term. We add an interaction term using the * sign. Let’s fit this regression and save it in score_model_interaction, then we get the regression table using the get_regression_table() function as before.

score_model_interaction <- lm(score ~ age + gender + age * gender, data = evals_ch7)
get_regression_table(score_model_interaction)
## # A tibble: 4 x 7
##   term           estimate std_error statistic p_value lower_ci upper_ci
##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept         4.88      0.205     23.8    0        4.48     5.29 
## 2 age              -0.018     0.004     -3.92   0       -0.026   -0.009
## 3 gendermale       -0.446     0.265     -1.68   0.094   -0.968    0.076
## 4 age:gendermale    0.014     0.006      2.45   0.015    0.003    0.024

The modeling equation for this scenario is (Writing the equation):

\(\hat{y} = b_0 + b_1 \cdot x_1 + b_2 \cdot x_2 + b_3 \cdot x_1 \cdot x_2\)

\(\hat{score} = 4.883 - 0.018 \cdot age - 0.446 \cdot 1_{Male}(x) + 0.014 \cdot age \cdot 1_{Male}(x)\)

The model for male:

\(\hat{score} = 4.883 - 0.018 \cdot age - 0.446 \cdot 1 + 0.014 \cdot age \cdot 1\)

\(\hat{score} = 4.437 - 0.004 \cdot age\)

The model for female:

\(\hat{score} = 4.883 - 0.018 \cdot age - 0.446 \cdot 0 + 0.014 \cdot age \cdot 0\)

\(\hat{score} = 4.883 - 0.018 \cdot age\)

We see that while male instructors have a lower intercept, as they age, they have a less steep associated average decrease in teaching scores: 0.004 teaching score units per year as opposed to -0.018 for women. This is consistent with the different slopes and intercepts of the red and blue regression lines fit in the original scatter plot.

ggplot(data = evals_ch7, mapping = aes(x = age, y = score, color = gender)) + 
    geom_point(alpha = 0.5) +
    labs(x = "Age", y = "Teaching Score", title = "Using Parallel Slopes") +
    geom_abline(intercept = 4.883, slope = -0.018, color = "tomato", lwd=1) +
    geom_abline(intercept = 4.437, slope = -0.004, color = "mediumturquoise", lwd=1)

2.3.4 Observed/fitted values and residuals

regression_points <- get_regression_points(score_model_interaction)
regression_points
## # A tibble: 463 x 6
##       ID score   age gender score_hat residual
##    <int> <dbl> <int> <fct>      <dbl>    <dbl>
##  1     1   4.7    36 female      4.25    0.448
##  2     2   4.1    36 female      4.25   -0.152
##  3     3   3.9    36 female      4.25   -0.352
##  4     4   4.8    36 female      4.25    0.548
##  5     5   4.6    59 male        4.20    0.399
##  6     6   4.3    59 male        4.20    0.099
##  7     7   2.8    59 male        4.20   -1.40 
##  8     8   4.1    51 male        4.23   -0.133
##  9     9   3.4    51 male        4.23   -0.833
## 10    10   4.5    40 female      4.18    0.318
## # … with 453 more rows
ggplot(score_model_interaction, aes(x = .fitted, y = .resid)) + geom_point()